Cellen 2

Data wrangling V — joins

Gavin Simpson

Aarhus University

2025-02-17

Learning objectives

In this video of the data wrangling topic you will

  • Learn about relational data

Relational Data

New York Flights

nycflights13 contains five tables

  1. flights
  2. airlines
  3. airports
  4. planes
  5. weather

New York Flights

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

New York Flights

airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

New York Flights

airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows

New York Flights

planes
## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows

New York Flights

weather
## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

New York Flights

Source: Wickham & Grolemund R for Data Science

Keys

Variables that connect each pair of tables are called keys

Each plane is uniquely identified by the tailnum

Two types of key

  1. Primary key — uniquely identifies an observation in it’s own table planes$tailnum
  2. Foreign key — uniquely identifies an observation in another table flights$tailnum

Surrogate keys are keys that are created when a table doesn’t have one or more variables that uniquely identify an observation

mutate(tbl, id = row_number())

Elephants

Animal science Data are regularly stored this way

A study on behaviour in elephants used three tables

  1. qualitaive behavioural analysis
  2. daytime behaviour observations
  3. nighttime behaviour observations

Elephants

glimpse(elephant_day)
## Rows: 84
## Columns: 18
## $ elephant                  <chr> "Elephant1F", "Elephant1F", "Elephant1F", "E…
## $ observer                  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ observation               <chr> "Obs1", "Obs2", "Obs3", "Obs4", "Obs5", "Obs…
## $ stereotypy                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ wallowing                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ feeding                   <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 4, 1, 2,…
## $ foraging                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0,…
## $ feedforage                <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 6, 1, 2,…
## $ comfort                   <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 3, 0, 1, 0, 0, 0, 0,…
## $ comfortwallow             <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 3, 0, 1, 0, 0, 0, 0,…
## $ environmental_interaction <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 2, 2, 0, 0, 2, 0, 1,…
## $ affiliative               <dbl> 1, 2, 1, 1, 1, 1, 1, 2, 1, 5, 3, 1, 0, 0, 0,…
## $ agonistic                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ anticipating              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ locomotion                <dbl> 2, 0, 1, 1, 0, 1, 0, 0, 1, 3, 2, 0, 2, 0, 1,…
## $ sleeprest                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,…
## $ stereoantic               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ affilagonostic            <dbl> 1, 2, 1, 1, 1, 1, 1, 2, 1, 5, 3, 1, 0, 0, 0,…
glimpse(elephant_qba)
## Rows: 72
## Columns: 15
## $ elephant      <chr> "Elephant1F", "Elephant1F", "Elephant1F", "Elephant1F", …
## $ observer      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ observation   <chr> "Obs1", "Obs2", "Obs3", "Obs4", "Obs5", "Obs6", "Obs7", …
## $ content       <dbl> 8.6, 7.1, 7.0, 6.4, 8.0, 9.1, 8.6, 8.0, 8.4, 7.9, 9.5, 7…
## $ relaxed       <dbl> 8.8, 7.7, 8.0, 7.0, 9.0, 9.5, 9.0, 9.0, 9.7, 8.0, 8.9, 6…
## $ uncomfartable <dbl> 0.9, 0.4, 0.8, 1.3, 0.9, 0.4, 0.9, 0.9, 0.3, 0.5, 0.5, 0…
## $ agitated      <dbl> 0.2, 1.0, 0.8, 0.7, 0.3, 0.3, 0.2, 0.1, 0.5, 1.0, 0.3, 0…
## $ tense         <dbl> 1.5, 3.0, 1.2, 3.2, 0.3, 0.7, 0.2, 1.0, 0.5, 0.3, 0.9, 0…
## $ frustrated    <dbl> 0.7, 2.0, 0.2, 0.4, 0.4, 0.3, 0.3, 0.1, 0.2, 0.4, 0.4, 0…
## $ wary          <dbl> 0.9, 1.4, 2.9, 2.0, 0.7, 1.2, 0.8, 0.9, 1.0, 2.0, 0.8, 4…
## $ playful       <dbl> 0.2, 3.0, 1.5, 0.2, 0.4, 4.1, 1.2, 1.5, 5.7, 0.3, 5.0, 0…
## $ sociable      <dbl> 0.2, 0.1, 0.2, 5.8, 0.2, 0.2, 6.5, 0.1, 9.0, 0.1, 0.1, 0…
## $ lively        <dbl> 3.7, 5.0, 2.1, 4.0, 1.0, 4.0, 3.0, 1.0, 5.6, 3.8, 5.0, 5…
## $ lethargic     <dbl> 0.8, 0.4, 1.1, 1.5, 2.0, 0.8, 0.3, 2.0, 0.1, 0.6, 0.3, 0…
## $ observations  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Can you identify the keys in these tables?

Elephants

We don’t know the names of the elephants here, nor do we have metadata about each animal

But we can make some up:

elephant_meta <- tribble(
  ~elephant, ~name, ~sex, ~age,
  "Elephant1F", "Winifred", "female", 45,
  "Elephant2F", "Ellie", "female", 23,
  "Elephant3M", "Dumbo", "male", 57
)

elephant_observer <- tribble(
  ~observer, ~given, ~surname, ~institution,
  1, "Jan", "Lauridesen", "Aalborg Zoo",
  2, "Line", "Larsen", "Zoologisk Have København",
  3, "Charlotte", "Davidson", "Odense Zoo",
  4, "Else", "Jacobsen", "Odense Zoo",
)

Which column(s) are keys in elephant_meta?

Which column(s) are keys in elephant_observer?